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Abstract. Federated query engines allow data consumers to execute 
queries over the federation of Linked Data (LD). However, as feder¬ 
ated queries are decomposed into potentially thousands of subqueries 
distributed among SPARQL endpoints, data providers do not know fed¬ 
erated queries, they only know subqueries they process. Consequently, 
unlike warehousing approaches, LD data providers have no access to sec¬ 
ondary data. In this paper, we propose FETA (FEderated query TrAcking), 
a query tracking algorithm that infers Basic Graph Patterns (BGPs) pro¬ 
cessed by a federation from a shared log maintained by data providers. 
Concurrent execution of thousand subqueries generated by multiple fed¬ 
erated query engines makes the query tracking process challenging and 
uncertain. Experiments with Anapsid show that FETA is able to extract 
BGPs which, even in a worst case scenario, contain BGPs of original 
queries. 


1 Introduction 

The federation of the Linked Data (LD) interlinks massive amounts of data 
across the Web. Federated query engines |12|l|2|d |10|, allow data consumers to 
query data residing in the federation in a transparent way as if they were a single 
RDF graph. 

Query engines split user’s query into subqueries distributed among SPARQL 
endpoints without revealing the whole federated query. Hence, data providers do 
not know the complete federated query in which they participate, they do not 
know which of their data are combined, when and by whom. Consequently, data 
providers have a partial access to secondary data m, unlike data warehousing 
approaches. 

In this paper we propose FETA (FEderated query TrAcking), a query tracking 
algorithm that computes original federated BGPs (Basic Graph Patterns) from 
shared logs maintained by data providers. Concurrent execution of thousand sub¬ 
queries generated by multiple federated query engines makes the query tracking 
process challenging and uncertain. To tackle this problem, we developed a set 
of heuristics that links or unlinks variables used in different subqueries of a join 
federated query. We experimented FETA over concurrent execution of queries 
of the benchmark FedBench m ■ Even in a worst case scenario, FETA extracts 
BGPs that contain federated BGPs used in original queries. 




2. BACKGROUND AND MOTIVATIONS 


The paper is organized as follows: Section^ introduces a motivating example 
and describes the scientific problem. Section [3] presents FETA and heuristics for 
query deduction. Section U illustrates experimental results. Section [5] overviews 
some related work. Finally, conclusions and future work are outlined in Section[Gl 

2 Background and Motivations 

Given a SPARQL query and a federation defined as a set of SPARQL endpoints, 
a federated query engine performs the following tasks 021: W query decomposi¬ 
tion , normalizes, rewrites and simplifies queries; (ii) data localization , performs 
source selection among defined federation and rewrites the query into a dis¬ 
tributed query; (iii) global query optimization, optimizes distributed query by 
rewriting an equivalent distributed query with various heuristics: minimizing in¬ 
termediate results, minimizing number of calls to endpoints, etc. In Figure|TJ we 
observe that two join operations will be executed in the federated query engine 
with data coming from subqueries sent to SPARQL endpoints; (iv) distributed 
query execution , executes the optimized plan with physical operators available 
in federated query engines. 


Query CD3 of FedBench 


SELECT ?pres ?party ?page WHERE { 

"pres rdf : type dbpedia — owl : President. 
?pres dbpedia — owl : nationality dbpedia : Un 
?pres dbpedia — owl : party ?party. 
nytimes : topicPage ?page. 
owl : same As ?pres } 
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Fig. 1. Query processing and FETA’s deduction for CD3. 


Federated query tracking infers federated queries from a shared log main¬ 
tained by data providers. We illustrate the general process in Figure [D A feder¬ 
ated query engine executes a federated query on a federation of SPARQL end¬ 
points. FETA collects the logs of a federation of data providers and infers BGPs 
used in federated queries. By this way, data providers that collaborate have ac¬ 
cess to secondary data. In our example, FETA allows NYTimes data provider 
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2. BACKGROUND AND MOTIVATIONS 


to know which of his data are used in conjunction with DBPedia data. Query 
tracking can be applied to many federated query engines, in this paper we focus 
on tracking queries processed by the Anapsid [I] federated query engine, with 
its join physical operators, nested loop with filter options (nlf o) and symmetric 
hash (symhash). 


Time 

Subquery/Answer 

Endpoint 

11:24:19 

(Subquery 1) SELECT ?pres WHERE { 

?pres rdf:type dbpedia-owl:President } 

DBPedia 

InstancesTypes 

11:24:23 

(Answer) { {var: "pres", values: 

"http: / / dbpedia.org/resource/Ernesto_Samper,..., 

"http: //dbpedia.org/resource/Shimon_Peres,..., 

http://dbpedia.org/resource/Barack Obama" } } 

DBPedia 

InstancesTypes 

11:24:21 

(Subquery 2) SELECT ?party ?pres WHERE { 

?pres dbpedia-owl:nationality dbpedia:United_States . 

?pres dbpedia-owl:party ?party } 

DBPedia 

InfoBox 

11:24:24 

(Answer) { {var: "party", values: 

"http: //dbpedia. org/resource/Democratic_Party_ 

"%28United_States%29,..., 

http: / / dbpedia.org/resource/Independent_%28politics%29", 

http://dbpedia.org/resource/Republican_Party_%28US%29" }, 

{ var: "pres", values: 

" http: //dbpedia. org/resource/Barack _ Obama ,..., 

http:// dbpedia. org /resource/Johnny _ Anders ,..., 

http://dbpedia.org/resource/Judith Flanagan Kennedy,..." } } 

DBPedia 

InfoBox 

11:24:25 

(Subquery 3) SELECT ?pres ?x ?page WHERE { 

?x nytimes:topicPage ?page . 

?x owksameAs ?pres . FILTER 

((?pres=<http://dbpedia.org/resource/Barack _ Obama>) \ \ 

( ?pres=<http://dbpedia.org/resource/Johnny _ Anders>) \ 

( ?pres=<http://dbpedia.org/resource/Judith _ Flanagan _ Kennedy >),...) 

}} LIMIT 10000 OFFSET 0 

NYTimes 

11:24:27 

(Answer) { {var: "pres", values: 

"http://dbpedia.org/resource/Barack_Obama" } 

{ var: "x", values: 

"http://data.nytimes.com/47452218948077706853" } 

{ var: "page", values: 

"http: / / topics.nytimes.com / top / reference/timestopics / 
people/o/barack_obama/index.html" } 

NYTimes 


Table 1 . Partial logs of DBPedia (InstancesTypes, InfoBox) and NYTimes. 


Figure [T| illustrates how Anapsid processes query CD3 from FedBenclf] [Ilj . 
The goal is to find all US presidents, their party membership and pages with 
news about them. Table [Q presents an extraction of federated log with traces of 

1 Prefixes for queries presented in this article are: 

PREFIX dbpedia: <http://dbpedia.org/resource/> 

PREFIX dbpedia-owl: <http://dbpedia.org/ontology/> 

PREFIX foaf: <http://xmlns.eom/foaf/0.l/> 

PREFIX geonames: <http://www.geonames.Org/ontology#> 

PREFIX linkedMDB: <http://data.linkedmdb.org/resource/movie/> 

PREFIX nytimes: <http://data.nytimes.com/elements/> 

PREFIX owl: <http://www.w3.Org/2002/07/owl#> 

PREFIX purl: <http://purl.org/dc/terms/> 

PREFIX rdf: <http://www.w3.Org/1999/02/22-rdf-syntax-ns#> 
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Query CD3 of FedBench 


Query CD4 of FedBench 

SELECT ?pres ?party ?page WHERE { 

A 

SELECT ?actor ?news WHERE { 

A 

?pres rdf : type dbpedia — owl : President . 

(tpl) 

?film purlititle ’Tarzan’ . 

(*Pl) 

?pres dbpedia — owl : nationality dbpedia : United _ States 

(tp2) 

?film linkedMDB:actor ?actor . 

(*P 2) 

?pres dbpedia — owl : party ?party . 

(tp3) 

?actor owlisameAs ?x . 

(*P 3 ) 

?x nytimes : topicPage ?page . 

(tp4) 

?y owlisameAs ?x . 

(*P4) 

?x owl : same As ?pres } 

(tp5) 

?y nytimesitopicPage ?news } 

(*P 5 ) 

^ s- 

J 

-A— : - 



query 


/ query 



Fig. 2. Query processing and FETA’s deduction for CD3, CD4 concurrent execution. 


the CD3 execution. It contains some subqueries and associated answers^ of end¬ 
points. These traces correspond to subqueries sent by one query engine, identified 
by its IP address. 

In Figure[T]we see that Anapsid evaluates individually tpl of CD3, at DBPedia- 
instances types. Subsequently, this query engine choses a nlf o implementation to 
join data retrieved from DBPedia-infobox and NYTimes. Consequently, Anap¬ 
sid sends tp2 M tp3 to DBPedia-infobox and stores intermediate results. Then, 
it calls NYTimes with several subqueries containing these intermediate results 
in filter options. This is confirmed in Table [Q where answers of subquery 2 
from DBPedia-infobox are injected in the filter part of subquery 3 and sent to 
NYTimes. Anapsid iterates until all intermediate results are sent to NYTimes, 
in order to avoid reaching the endpoint’s limit response. Finally, results of the 
nlf o implementation are joined locally at Anapsid with results of the first triple 
pattern’s evaluation, using the symhash operator. 

Operator nlfo is represented by an arrow because the order of the join is 
deduced from logs, i.e., it is possible to know in which direction the nested loop 

2 Query results are sent in JSON format. 
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3. FETA: FEDERATED QUERY TRACKING 


is made. But, symhash is represented by a dash line because it is impossible to 
know the order of the join made locally by the query engine. 

It is clear that a single federated query can generate many subqueries sent 
to endpoints according to physical join operators. However, such behavior can 
be tracked if endpoints collaborate and BGPs from the federated query can be 
inferred. 

Federated query tracking is challenging if many federated queries having 
common join conditions are executed concurrently. The most challenging case is 
when, in addition, queries are sent by the same query engine. We propose some 
heuristics to separate subqueries belonging to different federated queries sent by 
the same query engine. Figure [2] shows the concurrent execution of queries CD3 
and CD4 sent by query engine QEi. These queries have common variable ?x. 
When logs are federated, this variable joins BGPs of both queries. 

Problem statement. Given a federated log containing independent sub¬ 
queries, link subqueries on their common join conditions, i.e., variable, IRI or 
literal, if they participate to the same federated query and deduce BGPs pro¬ 
cessed by the federation. The desired output is a set of BGPs indicating (i) which 
endpoints evaluated which triple patterns, (ii) whom issued the federated query, 
and (iii) in which period of time the deduced BGP was processed. 

At the bottom of Figures Q] and [2] there are deduced BGPs corresponding to 
queries appearing at the top of these Figures. Information about which end¬ 
points collaborate with which triple pattern, the federated query engine that 
issued the query, and the time period where the queries were processed appear 
too. 

3 FETA: FEderated query TVAcking 

Figure [3] describes how FETA processes a federated log. A federated log is a 
sequence of subqueries with answers as described in Table [T| The goal is to 
link different subqueries participating in the same join, in order to reconstruct 
federated BGPs. T*, Same queries and Common join condition , allow to join 
subqueries’ BGPs. NLFO, Same Concept/Same As and Not Null Join verify 
joins and potentially unlink BGPs. 



(subqueries / answers) 


Fig. 3. Workflow processing of FETA’s heuristics. 
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3. FETA: FEDERATED QUERY TRACKING 


Tix identifies subqueries of the same time interval that will be analyzed to¬ 
gether. For instance, all subqueries in Table [U are captured in the interval of 
few seconds. The challenge is to choose the appropriate time interval. A small 
window may separate subqueries pertaining to the same federated query. A large 
window may join subqueries of different federated queries. 

Same queries merges identical subqueries but also subqueries differing only 
in their offset values. Same queries are sent twice to the same endpoint to be 
sure obtaining an answer and to different endpoints in order to have complete 
answers. For instance, every query in Table [T] is sent twice consecutively to each 
selected endpoint. Additionally in Figure 0 we observe that the first two triple 
patterns of CD6 are evaluated separately to different endpoints, with the aim 
of having complete answers. Similar queries with different offsets, on the other 
hand, are sent to avoid reaching the endpoint’s limit response. For instance, 
the subqueries below which are sent to Geonames for evaluating query CD7, 
are merged by FETA and considered as a unique subquery without the part 
limit/offset: 

SELECT ?location ?news WHERE { 

?y <http://data.nytimes.com/elements/topicPage> ?news 
?y <http://www.w3.org/2002/07/owl^sameAs> ?location } 

LIMIT 10000 OFFSET 0 

SELECT ?location ?news WHERE { 

?y <http://data.nytimes.com/elements/topicPage> ?news 
?y <http://www.w3.org/2002/07/owl^sameAs> ?location } 

LIMIT 10000 OFFSET 10000 


Common join condition , joins BGPs of queries having common projected 
variables or triple patterns with common IRI/literal on their subjects or objects. 
We are aware that, in general, subqueries are joined on their common projected 
variables. However, we consider also IRIs and literals, even if it can produce some 
noise on our deduction approach, because they are used in some cases as a com¬ 
mon join condition. For instance, in Figure [H the IRI dbpedia:Barack_Obama , 
is a join condition between triple patterns of CD2. We presume that subqueries 
with common join condition, closely in time, may be joined locally at the query 
engine using the symmetric hash operator. For instance, in Table [I] all sub¬ 
queries have variable ?pres in common and thus we suppose they are joined at 
Anapsid. 

Nested Loop with Filter Options (NLFO), verifies if BGPs, joined in the previ¬ 
ous heuristic, were executed with a nested loop operator. In particular, we group 
queries varying only in their filter values, if these values are contained in answers 
of a previously evaluated subquery, with which we confirm that they are joined. 
For instance, in Table[l]we identify that filter values of subquery 3 correspond to 
answers of variable ?pres, e.g., <http://dbpedia.org/resource/Barack_Obama>, 
for subquery 2. This certifies a nlfo between subqueries 2 and 3, discarding a 
global symhash join among the three subqueries. 
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Same Concept/Same As, verifies if answers of joined queries correspond to 
same concepts or concepts related with a sameAs property^ If this is not the 
case, concerned BGPs are unlinked. For instance, in Tablc[lJ answers of the triple 
pattern of subquery 1 have the same concept with the second triple pattern of 
subquery 3, for variable ?pres, i.e., <http://dbpedia.org/ontology/President>. 

Not Null Join, verifies if a join returns an empty set of answers. If this is 
the case, concerned BGPs are unlinked. For instance, in Table [T| triple pat¬ 
terns in subqueries 1 and 3 have a common value for projected variable ?pres, 
i.e., <http://dbpedia.org/resource/Barack_Obama> and therefore they remain 
linked in the same BGP. 

4 Experiments 

We analyzed the collection of 7 federated queries of Cross Domain (CD) of the 
benchmark FedBench nn. Datasets are those concerned by these queries: DBPe- 
dia, Jamendo, LMDB, NYTimes, SWDF and Geonames. Virtuoso OpenLinI0 
6.1.7 is hosting SPARQL endpoints. We used Anapsid 2.7 as federated query 
engine with the cache disabled. Answers of endpoints to the subqueries they 
received, are captured with tcpdump 4.5. 10 . FETA is implemented in Java 1.7 
and the collected federated log is stored in a CouchDB database^. 

We evaluated FETA under two configurations. In the first configuration, one 
Anapsid client processes all federated queries sequentially with a delay between 
each query. In the second one, one Anapsid client processes all queries concur¬ 
rently, each one into an individual thread. Executing queries concurrently from 
a single client is clearly a worst case scenario for FETA because the IP address of 
the client cannot be used to split subqueries of the federated log. For the scope 
of this paper, we suppose that all endpoints concerned by federated queries share 
their logs. 

With the first configuration, FETA reconstructs correctly all federated BGPs 
of the CD collection. 0 We focus now on the second experiment. In this case, 
Anapsid produces 529 subqueriesH Size of queries and answers logs are of 300KB 
and 14MB, respectively. 

Table [2] shows the number of BGPs produced after each heuristic following 
the FETA execution workflow shown in Figure [0 FETA processes this federated 
log in approximately 90 seconds. Ti* is not significant here, we consider it big 
enough to cover the execution of all federated queries of CD. Initial log contains 
238 SELECT subqueries as they are unlinked, there are 238 BGPs. Same queries 

3 Note that we do not consider generic concepts for this heuristic, e.g., 

<http://www. w3. org/2002/07/owl# Thing>. 

4 http://virtuoso.openlinksw.com/ 

5 http://www.tcpdump.org/ 

6 http://couchdb.apache.org/ 

' Each deduced BGP corresponds to the federated query, once simplified and rewritten 
by the query engine at the query decomposition phase. 

8 Note that we subsequently remove ASKs and consider only SELECT subqueries. 
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FETA Heuristic 

Number of produced BGPs 

Same queries 

109 

Common join condition 

1 

NLFO 

1 

Same Concept/Same As 

2 

Not Null Join 

4 


Table 2. Number of BGP’s produced by heuristic. 


heuristic removes or merges more than 60% of subqueries and their respective 
answers, producing 109 BGPs. Common join condition produces a single BGP 
because chaining among queries. NLFO confirms joins, by identifying the injec¬ 
tion of answers from a subquery into subqueries which vary only in their filter 
values. Same Concept/Same As heuristic unlinks some joins and certifies others. 
Not Null Join certifies that a symmetric hash is certainly possible, because an 
intersection of answers on a common projected variable of every two subqueries. 


CD1 
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SELECT ?predicate Yobject WHERE { 


A 
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SELECT ?pres ?party ?page WHERE { 

"pres rdf : type dbpedia — owl : President . (*Pl C D3^ 

?pres dbpedia — owl : nationality dbpedia : United _ States . (ip 2 C D3^ 

?pres dbpedia — owl : party ?party . (tp 3 q £> 3 ) 

nytimes : topicPage ?page . (*P 4 _CD 3 ) 

owl : sameAs ?pres } (*P5_CD3) 
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Fig. 4. FETA’s deduced BGPi for CD concurrent execution. 


Figures QK] present BGPs extracted by FETA for all concurrently executed 
CD queries. Note that query plans established by Anapsid for each query, may 
differ depending on endpoints availability and when operators are blocked. Ide¬ 
ally, FETA should reconstruct 8 BGPs. The CD collection consists of 7 queries 
but CD1 is a union query normally decomposed in 2 BGPs. FETA extracted 4 
BGPs containing the 8 original BGPs. Even if this result is not precise, extracted 
BGPs with endpoints’ information give valuable information to data providers 
































4. EXPERIMENTS 


CD5 


SELECT ?film ?director ?genre WHERE { 

?film dbpedia — owl : director ?director . (tp i qDq) 

?director dbpedia — owl : nationality dbpedia : Italy . (tp 2 CD5 ) 
?x owl : sameAs ?film . ( tp 3_CDb ) 

?x linkedMDB : genre ?genre } (*P 4 CD5 ) 


nlfo 

symhash 


bgp 2 

<QE i ,[ll:24:24]-[ll:24:30]> 

[ t P2_CDb 1x1 *Pl_CD5 ] 


‘P4_CD5 1x1 *P3_C-D5 QLMDB 


Fig. 5. FETA’s deduced BGP 2 for CD concurrent execution. 


bgp 3 


CD6 
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SELECT ?name ?location WHERE { 

?artist foaf : name ?name . (tp 1 q £> 5 ) 

?artist foaf : based _ near ?location . (tp 2 C Dq) 

?location geonames : parentFeature ?germany . (tp 3 qDq) 

?germany geonames : name "Federal Republic of Germany" } ( tp 4 CD6 ) 
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f~TZ _ ]@Jamen( 
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[*Pl_CD 6 j 


@DBPedia, 
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Fig. 6. FETA’s deduced BGP 3 for CD concurrent execution. 


CD4 


SELECT ?actor ?news WHERE { 

N 

?film pi 

irl : title ’Tarzan’ . 

(*Pl CD4) 

?film li 

nkedMDB : actor ?actor . 

( tp 2 C DA) 

?actor 

wl : sameAs ?x . 

( tp 3 CDA) 

?y owl 

sameAs ?x . 

( tp A CDA ) 

?y riytt 

mes : topicP age ?news } 

( f P5 CD4) 
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CD7 


SELECT ?location ?news WHERE { 

?location geonames : parentFeature ?parent . 
?parent geonames : name "California" . 

?y owl : sameAs ?location . 

?y nytimes : topicPage ?news } 


(*Pl_ C D7 ) 

( tp 2_C D7) 

(t p 3 _ CD7 ) 

(*P4 CD7) 

J 


BGP 4 

<QE^ , [11:24:23]- [11:24:36] > 
(*Pl_C04 M tp 2_CDA M tp 3_CD4 )@LMDB 

T 

/" ~ [ tp 5_CD4 M *P4_C£?4 j @NYTimes 


t f tp 2_CD7 1x3 tp l_CD7 1@ Geonames 

\ " T 

|*P4_CD7 1x1 tp 3_CD7 j @NYTimes 


nlfo 

symhash 


Fig. 7. FETA’s deduced BGP± for CD concurrent execution. 


bout how their data are processed and (potentially) joined with other endpoints. 
In the following paragraphs, we explain how FETA deduces each particular BGP. 

Figure[I]describes how FETA processes federated queries CD1, CD2 and CD3. 
CD1 is composed of two BGPs separated by a union, which we expect to identify 
individually. In fact, these two BGPs were deduced as a single BGP because they 
have a common IRI, dbpedia:Barack_ Obama and also share common answers for 
both variables Tpredicate and Tobject which concern Barack Obama. Next, 
we observe that GDI, CD2, and CD3 were grouped in one BGP. GDI and CD2 
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were not separated because of the common IRI dbpedia:Barack_Obama , which is 
actually also a join condition between triple patterns of CD2. BGPs of CD2 and 
CD3 were not separated because results of CD2 are included in CD3 for their 
common triple pattern, lx nytimes : topicPage Ipage , but also for the other 
triple patterns of CD2. 

In Figures [S] and O we can see that BGPs of CD5 and CD6 were well recon¬ 
structed. CD4 and CD5 are linked in the same BGP with Same Concept/Same 
As, as both concern films. Subsequently, Not Null Join separated CD4 from CD5 
on the content of the ?film variable, as CD4 concerns films related to Tarzan 
while CD4 concerns films of Italian directors. In a similar way, CD6 and CD7 
are linked with Same Concept/Same As, as both concern localizations but they 
were separated because they have no common answer for variable ?location as 
the first concerns the Federal Republic of Germany and the second California, 
USA. 

Figure 0 shows FETA’s deduced BGP, grouping CD4 and CD7. On the other 
hand CD4 and CD7 share common concepts but also answers because, for the 
currently employed heuristics, we infer that these two queries share the same 
triple pattern ly nytimes : topicPage Inews. 

From this experiments we conclude that (i) it is possible to reconstruct precise 
federated BGPs if federated queries are different enough, and (ii) reconstructed 
BGPs contain all original BGPs, i.e., false joins are not deduced. 


5 Related Work 

Federated query tracking is related to web tracking [S], In web tracking, a first- 
party website authorizes a third party to learn about its users. Analogously, 
FETA plays the role of the third party. However, logs collected in federated query 
tracking is the result of the execution of a physical plan in distributed query 
processing compared to a more simple web navigation flow in web tracking. 

Extracting information from raw logs is traditionally a data mining pro¬ 
cess [5) involving the following steps: (i) data selection identifies the target 
dataset and relevant attributes that will be used to derive new information; 
(ii) data cleaning removes noise and outliers, transforms field values to common 
units, generates new fields and finally brings the data into the structured data 
schema that is used for storage, e.g., relational databases, XML; (iii) data min¬ 
ing applies data analysis and discovery algorithms based on machine learning, 
pattern recognition, statistics and other methods; (iv) finally evaluation presents 
the new knowledge in a form that will be also understandable from the end user, 
e.g., through visualization. 

FETA can be located at the data mining step because it transforms raw logs 
into a sequence of sets of BGPs. In experiments, we presented one extraction 
for a period of time. Repeating extractions will generate a sequence of extracted 
BGPs than can be used for association rules mining of frequent pattern detection. 
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6 Conclusions and future work 

Federated query tracking allows data providers to access secondary data in 
Linked Data federation. We proposed FETA, a federated query tracking approach 
that extracts original federated Basic Graph Patterns from a shared log main¬ 
tained by data providers. FETA links and unlinks variables present in different 
subqueries of the federated log by applying a set of heuristics we presented in 
this paper. 

Even in a worst case scenario with Anapsid, FETA extracts BGPs that con¬ 
tain original BGPs of federated join queries. Extracted BGPs with endpoints’ 
information give valuable information to data providers about which triples are 
joined, when and by whom. 

We think FETA opens several interesting perspectives. First, heuristics can 
be improved in many ways by better using semantics of predicates and answers. 
Second, we conducted experiments on one slice of federated log. Repeating BGPs 
extractions on successive slices allows to apply traditional data mining techniques 
such as extracting frequents patterns. Third, we limited experiments to Anapsid. 
Extending to FedX query engine m will challenge proposed heuristics because 
FedX physical operators produce slightly different query traces. 
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